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WEB INTERACTION SYSTEM WHICH ENABLES A MOBILE 
TELEPHONE TO INTERACT WITH WEB RESOURCES 

5 BACKGROUND OF THE INVENTION 

1 . Field of the invention 

This invention relates to a web interaction system which enables a mobile telephone to 
10" interact with web resources. It can, for example, be used by a mobile telephone to locate 
and purchase goods (e.g. buy CDs, download music or images etc.) or services (e.g. buy 
train tickets, places bets etc). The web interaction system extracts information from web 
sites and performs queries on that information to (for example) locate and/ or purchase 
goods and services of interest. The web interaction system is located in a server remote 
15 firom the mobile telephone and communicates with it over a wireless WAN, e.g. a GSM 
network. 

2. Description of the Prior Art 

20 

Searching web resources using a mobile telephone has conventionally been done by a 
user manually browsing different WAP (or iMode) sites. This restricts choice to a 
relatively small subset of possible suppliers — i.e. those with wireless protocol enabled 
sites. Further, because of the small screen size of mobile telephones, the user interaction 

25 process is awkward and can involve many discrete steps, making the process awkward 
and hence likely not to be completed. Finally, tiie relatively low data connection speeds 
of WAP and iMode mobile telephones can make die overall browsing process a slow 
one. Next generation networks, such as GPRS and 3G, will partiy address tiiese 
problems by offering faster connection speeds and mobile telephones with larger 

30 screens. However, die inherent Umitations of screen size and data connection speed will 
still make the overall experience of interacting widi web resources (e.g. to undertake 
mobile commerce) on even a 2.5G or 3G mobile telephone far less appealing than witii a 
desktop machine connected to die Internet over a broadband link. It is therefore 
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possible that users of 2G mobile telephones (and possibly also even 2,5G and 3G 
phones) will choose not to use their mobile telephones to search web resources 

In the Internet, 'web spiders' are well known: these are programs which automatically 
5 visit large numbers of web sites and download their content for later analysis. Some web 
spiders go beyond just passively reading content and also can submit simple, pre-defined 
forms (e.g. giving a password in order to read an access controlled site). Spiders can also 
be used to automate a real time enquiry from a user to locate goods or services ~ for 
example, by visiting a number of travel web sites to obtain the best price for airline travel 
10 to a destination etc, defined by a user. 

However, web spider functionality is still very limited and is typically only activated once 
a user has reached a web portal/site. Since most mobile telephones inhibit their users 
from even connecting to a web portal/ site in the first place (because of user interaction 
15 and data connection limitations, as explained above), web spiders have had litde real 
impact on mobile commerce undertaken using mobile telephones. 
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SUMMARY OF THE PRESENT INVENTION 

In a first aspect of the invention, there is a web interaction system which enables a 
tnobile telephone to interact with web resources, in which the web interaction system 
comprises a query engine which operates on XML format data obtained from content 
data extracted from a web site, the query engine parsing the XML format data into SAX 
events which are then queried by the query engine. • 

Conventional query engines parse XML into a data object model (DOM) tree and not 
SAX events; DOM trees have certain advantages over SAX events in that, once 
constructed, it enables complex query processing by navi^ting through the DOM tree. 
DOM trees can however occupy significant memory space. SAX events on the other 
hand can be queried as parsing progresses (i.e. no need to wait for an endte DOM tree to 
be constructed before queries can be first performed) and are also light on memory 
(since no large DOM tree needs to be stored). Not needing to wait for an entire web 
document to download is a major advantage since this would otherwise be a major 
botdeneck. SAX events are method caUs - e.g. Java software that calls code to petfoim 
an instruction. 

In one implementation of the present invention, querying the SAX events can then be 
done using an event stream based engine of an object oriented XML query language. 
This again differs from the conventional approach of using a relational (non object 
oriented) XML query language such as XQuery where the engine cannot operate on a 
stream of events and must keep the data in memory. The XML output which the query 
engine operates on . is derived from the source web site which is being 
browsed/interrogated (e.g. for information relevant to goods/services to be purchased). 
That web site typically provides HTML format data, which is translated into valid XML 
using a translation engine. 

The translation engine can also fiiUy define the nesting semantics (i.e. a parametetised list 
of rules to handle bad nesting, which is very commonplace on web sites) needed for 
efficient and valid XML- nesting is sometimes not done in HTML code, but is done in 
XML, so conventional HTML to XML translators address this problem by multiple 
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closure/re-opening steps, but this leads to very large XML nested structures. Defining 
the nesting semantics allows for much more compact XML to be generated. The nesting 
semantics typically cover what tags will open/dose a nested structure, what hierarchies 
of nesting are affected by what tags etc. 

Another feature of an implementation of the present invention is tiaat it uses an 
extensible plug-in framework which allows plug-in components to be readily added to 
the framework. Typical plug-ins cover different parsers (e.g. SAX event output parsers 
as described above, as well as conventional DOM parsers), support for different 
protocols (e.g. HTTP and also HTTPS) and different query languages (e.g. Object 
oriented XML query languages). 

The term 'mobile telephone' covers any device which can send data and/or voice over a 
long range wireless communication system, such as GSM, GPRS or 3G. It covers such 
devices in any form factor, including conventional telephones, PDAs, laptop computers, 
smart phones and communicators. 

In one implementation, a mobile telephone user sends a request for goods and services 
using a protocol which is device and bearer agnostic (Le. is not specific to any one kind 
of device or bearer) over the wireless network (e.g. GSM, GPRS or 3G) operated by a 
mobile telephone operator (e.g. Vodafone). The request is directed to tiie operator, who 
then routes it through to a server (typically operated by an independent company 
specializing in designing the software ruiming on such servers, such as Cellectivity 
Limited), which initiates a search through appropriate suppliers by using the above 
described web interaction system. 

The web interaction system automates the entire web browsing process which a user 
would normally have to undertake manually. The user in effect delegates tasks to the 
web interaction system, eliminating the need for continued real time connection to the 
Internet. The search may also depend on business logic set by the operator - e.g. it may 
be limited to suppliers who have entered into commercial arrangements with the mobile 
telephone operator controlling the web interaction system. 
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The web interaction system interacts with" web resovirces (not simply WAP, iMode or 
other wireless protocol specific sites), querying them, submitting forms to tiiem (e.g. 
password entry forms) and returning HTML results to the translation engine. The 
translation engine converts die HTML into properly nested XML by generating SAX 
events; the query engine tiien applies appropriate queries to the SAX events in order to 
extract die required information and generally interact with die web site in a way that 
simulates how a user would manually browse tiirough and interrogate die site in order to 
assess whedier it offers goods/services of interest and to actuaUy order those 
gopds/ services. 

The objective is for the consumer experience to be a highly simpUfied one, using 
predefined user preferences in order to make sure diat the goods/services offered to the 
consumer are highly likely to appeal. When the consumer is presented with 
goods/services, which are acceptable, he can initiate the purchase from die operator (and 
not die suppUer) using the mobHe telephone by sending a request to die operator over 
the wireless network operated by the operator. 

A method of enabling a mobile telephone to interact with web resources, in which the 
method comprises the steps of: 

(a) extracting content data from a web site according to an instruction sent 
from the mobile telephone; 

(b) obtaining XML format data from the content data; 

(c) parsing die XML format data into SAX events; 

(d) querying die SAX events using a query engine to generate query results; 

(e) providing a response to the instruction sent from the mobile telephone 
using the query result. 
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BRIEF DESCRIPTION OF THE DRAWINGS 

The present invention will be described with reference to the accompanying drawings in 
which Figure 1 is a schematic representation of a Simple API for XML (SAX) API and 
Figure 2 is a schematic representation of a Web Agent* Framework API. 

DETAILED DESCRIPTION 

The present invention is implemented in Web Agents technology from Cellectivity 
Limited of London, United Kingdom. Web Agents technology is a framework that 
allows easy, rapid and robust implementation of extremely lightweight software 
components that automate browsing on die world-wide web. The main idea behind the 
firamework is to look at the web as a huge cluster of databases. It uses a transfer protocol 
support to link itself to and perform actions on such a ^'database*'. It also queries the 
"database" using a query language, in order to extract information from it. The only 
tiling tiie agent programmer needs to code is die specific way to link to this "database" 
and the specific stmcture for the data inside it. 

The fiindamental building blocks in the firamework are 

1 . Transfer protocol handling support. 

2. Parsing of content language support. 

3. Querying content support. 

By providing these three building blocks and linking diem to one framework imit, Web 
Agents enables the ability to fiiUy interact widi any website, link to it, parse its content 
and query its content. The framework is written in Java and is built on top of the Java 
API for XML Processing (JAXP) and in particular the Simple API for XML (SAX). The 
use of the SAX standard enables better integration of the framework into other products 
and a very simple integration of any SAX functionality into the framework. 

By using the Web Agents framework,: the programmer has the complete solution to any 
activity she wishes to automate on the web. The generated agents are not limited to 
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7 

information extraction or web crawling, for example. There is no limit to any specific 
activity, specific transfer protocols or specific set of content languages. 

Another advantage of the firamework is its modularity. Every block implementation can 
be easily plugged in and out of the system. 

The Web Agents implementation provides the complete framework design and interfaces 
+ implementation of support for several transfer protocols (FILE, HTTP, HTTPS), 
several content languages (XML, HTML, JAVASCRIPT) and a query engine for a new 
XML query language called Xcomp. 

One major decision taken for Web Agents was to use the XML standard even when the 
content itself is not XML. XML is the universal format for structured data on the Web. 
Because Web Agents looks at any document on the web as if it was data, using XML fits 
naturally into the framework. Translating different languages to XML may not be an easy 
task In particular, when it translates HTML it needs defined rules about what to generate 
when die HTML code is not vaHd XML. Handling such behaviour in a generic way adds 
to die parser's robustness. The solution to this problem is covered in section 3 (HTML 
parser). 



The decision to use an event-based parser in the framework (use of SAX API) is linked 
to die demand to create lightweight agents. Keeping a whole XML tree of a web page in 
memory for every agent instance means not only too much memory but also too slow 
processing. Web applications' main botde neck is the web connection. Using the stream 
based approach minimizes diis by getting query results as the page is retrieved from the 
remote site and not only after it was fully retrieved and the data structure was 
constmcted. This means tiiat we can only use a query engine that works effidendy with a 
SAX stream. Currentiy, no such engine exists for any XML query language and in most 
cases the language itself requires the whole tree in order to evaluate one of its queries. 
(Specifically, the query languages that are based on relational algebra require a full table 
to perform any query processing). That is why in die present invention implement a new 
query language (Xcomp) has been designed with its engine built on top of the SAX 
interfaces. Note that the Xcomp engine is not part of die whole framework and a 
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different implementation of a query engine for the same or different language can be 
easily plugged in to the framework. 

Xcomp and in its engine performance is optimal. Using a different language or engine 
5 may have its affects on die efficiency (both memory and speed) of the agent 

This Detailed Description covers die framework definition in section 2. Then follows a 
description of two important non standard components built for die framework. The 
Non-Strict-HTML Parser is covered in section 3. Section 4 describes the Xcomp query 
10 language and its implementation on top of an event based (SAX) parser. " 



15 



2. Web Agents Framework 
2.1 Overview 

The framfework is composed out of Agent objects which create and run Pagent objects. 

A Pagent is a component which controls the interaction of an agent with a specific type 
of page on the web. It contains all the implementation of the interaction with tiiat page, 
meaning all the calls to the 3 different block instances (protocol, parser and query). 

An Agent is a component which controls die flow between one or more pagents (and 
thus simulates browsing between a specific sequence of pages). 

Figures 1 and 2 are schematic overviews, witii Figure 1 showing the SAX API standard 
and Figure 2 the Web Agents Framework API. 

In order to run, a Pagent needs a URL that defines its page and a set of Query Handlers 
which defines the queries we would like to perform on the page's content. Using die 
Factory design pattern, the Pagent gets hold of the specific protocol handler and die 
content parser it needs for its page. This process is done dynamically. The Protocol 
Handler Factory depends only on die URL to produce a Protocol Handier. The Content 
Parser Factory can depend on MIME type or file name suffix to produce a Content 
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Handler. A Content Parser is simply the org.xml.sax.XMLReader interface. 
(SAXReader in Figure 1). A QueryHandler is simply die org.xml.sax.ContentHandler 
and it is implemented by any query engine. The framework is bi:iilt on top of JAXP and 
dierefore, any content parser the framework accepts is a JAXP SAX Parser. The query 
handler links to diis parser as its org.xml.sax.ContentHandler, the object for the callback 
SAX events. 

A ProtocolHandler is an interface that supports the manipulation of its transfer 
protocol parameters. It also wraps a java.netURLCoiinection object and provides its 
functionality. Finally it links to an Environment object used by the agent thus enabling 
the agent programmer to persist the browsing state. The environment is a member of 
every web agent. 

Both agent and Pagent can be written directiy as a Java class or generated from a script. 
In section 4.3 we cover our Xcomp implementation including the generation of Pagent 
code from an Xcomp script. 

2.2 Framework Extensions 

The Web Agents framework is very generic. On top of die framework, any user can build 
extensions. Implementations of common generic actions on web sites. A good example 
of such extension is the form filler. 

HTML form filling and submission is a simple HTTP request which is constructed firom 
the data retrieved by a specific query after an HTML parser parsed the form page. 
Note that this Form filling capability is just a single case covered by the firamework. 

2.3 Framework API's 
2.3.1 ProtocolHandler 



com.cellectivity.protocol 
Class ProtocolHandler 
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+-com . cellectivi ty . protocol . ProtocolHandler 



public abstract class ProtocolHandler 
extends java.lang.Object 

Super class of all protocol handlers. This class implements generic functionality shared by 
all handlers. It holds a java,net.URLConnection as a member and uses it to connect to 
the web and control the connection protocol. 



10 




PrcboCOlHandler { j ava , ne t . URLConnect ion i_conn. 



com. cellectivity . agent .pagent . PagentRequest i__request ) 

all protocol handlers must have such a constructor in order to be created by the 

ProtocolHandlerFactory 





j ava , lang . String 


getContentTvpe ( ) 

get the content type of this connection 


abstract 
org . xml . sax . XMLReader 


cretDefaul-bParser ( ) 

return the default parser for this protocol. 


abstract 
j ava - lang . String 


getDef aultParserName 0 


return the default parser for this protocol 


j ava . util . HashMap 


getResponseHeader ( ) 

get all response headers 


j ava . util . HashMap 


g-e-tResponseHeader ( j ava . util • HashMap res 
ultMap) 

fill the hash map all response headers. 


j ava . lang . String 


geUlesponseHeaderFxeld ( j ava . lang . strin 


g field) 

get the value of the iSrst occurence of the 
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response header defined by the input param. 


j ava • lang . String [ ] 


getJResponseHeaderFxelds ( j ava . lang . s tri 


ng field) 

return all values for this response header field 


j ava . net . URL 


getURLO 

URL of this connection 


org . xml • sax • InputSour ce 


resolvelnputSource ( ) 

connect to die remote site and return the input 
stream. 



2.3.2 ProtocolHandlerFactory 



comxeUectivity.protocol 
5 Class ProtocolHandlerFactory 

java . lang • Ob j ect 
1 

+-coxn. cellectivity .protocol . ProtocolHandlerFactory 

10 — __ 

public class ProtocolHandlerFactory 
extends java.lang.Object 

create a protocol handler according to the default class name 

"com.ceIlectivity.protocoL.Handler.class" and if no class is found or an error occured tty 
15 look for a name in the global config (key = protocol/ /handler" 




ProtocolHandlerFactory 0 
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.static 

com. cellectivity. protocol • ProtocolHandler 



createProtocolHandler 



(com. cellectivity . agent .pag 
ent , Pagan tRequest i_request 
) 

create a Protocol Handler 
from the given Pagent Request 



2.3.3 ParserFactory 



com.cellectivity.coatent 
5 Class ParserFactory 

j ava . lang . Ob j ect 
I 

+-com . cellectivity . content . ParserFactory 

10 : 

pubUc class ParserFactory 

extends java.lang.Object 

Create a parser for a specific content type. The content is defined by either natne (allow 
programmer to override), mime type or filename suffix. 

15 The factory looks in the environment object for the values of the properties of the 
format: 

'^content/ sax/ parser/ *" 
^'content/ sax/ parser/ mime/ *" 
^'content/ sax/ parser/ suffix/*" 
20 where * is the value it has for that particular property type. 

It looks for the values in that order and returns the parser whose class name is the value 
of that name. If none found or some error occured, it returns the default parser defined 
for each protocol handler. 
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Par serFacbory ( ) 





static org.xml . sax.XMLReader 


createPar Ser ( j ava , lang . string conten 


tName , 

com . cellectivi ty . protocol . ProtocolHandl 
er ph, 

com . eel lect ivi t y . agent . www . Envi r onment 
env) 

create a parser according to algorithm 
described above. 


static org.xml.sax.XMLReader 


createParser ( j ava . lang . string conten 


tName , 

com , cellectivi ty . protocol . ProtocolHandl 
er ph, 

com . cellectivity . agent . www . Environment 
env, j ava . lang. String def) 

override the default parser using this 
method 



2.3.4 PagentRequest 



com.cellectivity.agent.pagent 
Class PagentRequest 

j ava • lang . Ob j ect 
I 

^-com-cellectivlty. agent .pagent. PagentRequest 
All Implemented Interfaces: 
javaio.Serializable 



public class PagentRequest 



wo 03/014971 



14 



PCT/GB02/03702 



extends java.lang.Object 
implements javaio.Serializable 

A request object to a Pagent This object wraps together the agent envioronment, a URL, 
a timeout value and a generic additional data object. 

5 See Also: 

Serialised Form 





PagentJlequest ( java . lang* string i_url, 

com, cellectivity. agent .www. Environment i_env. 


long 


i_timeout) 




PagentRequest ( j ava . lang . string i_url, 

com . cellectivity . agent . www • Environment i_env, 

j ava . io . Ser ializable i_additionalData) 


long 


i_timeout. 







java. io.Serializable 


getAdditionalData ( ) 


com . cellectivity . agent . www . 

Environment 


getEnvo 


long 


gretTimeou-t () 


j ava . lang • String 


getUrlO 


void 


setAdditionalData ( j ava . io . Serializabl 
e i_additionalData) 


void 


se tEnv (com . cellectivity . agent . www . Envir 
onment i_env) 
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void 


setTimeout (long i timeout) 


void 


IsetUrl (java.lang. String i_url) 



2.3.5 Environment 



com.cellectivity.agent.www 
5 Class Envkonment 



j ava . lang . Obj ect 



I 



+-COIII . cellectivi'^ . agexi'b 
10 All Implemented Interfaces: 
javaicScrializabie 



www . Environment 



public class Envitonment 
extends java.lang.Object 
15 implements javaio.Serializable 

General Envitonment object for an agent. 

See Also: 

com, cellectivity. protocol . http. HttpEnvironment . , ^CT^tiy-etd Form 



20 




create a new empty environment. 





j Environnvent 


cloneEnvxronment ( j ava . lang . string ref erer) 




clone this environment and set the referer to be Veferer' 
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j ava . lang , Ob j ect 


getParameter (java. lang. String key) 


j ava . util . HashMap 


aetLParame-ters 0 


j ava . util . SortedMap 


ae'bProper'tles (java. lang. string pathKey) 


j ava . util . Iterator 


getPropertiesKeys (java. lang. string pathKey) 


j ava . util , Iterator 


qet:Propert:lesValues (java. lang. string pathKey) 


j ava . lang , String 


getProperty (java. lang. String key) 


java.net . URL 


cretReferrer ( ) 


java . lang. String 


removeProper-ty (java , lang . string key) 


void 


se^Parame'beiT ( j ava . lang . string key, 
java. lang. String value) 


void 


setParcUae'texs ( j ava . util . HashMap params ) 


void 


setPrbper*ty (java. lang. String key, 
java . lang. String value) 


void 


se'bRefexreJT (java. lang. string referrer) 


void 


setReferrer( java. net. URL referrer) 




2.3.6 Pagent 
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com.celleciivity.agent.pagent 
Interface Pagent 

public interface Pagent 

A Pagent from the point of view of die programmer. Any implementation of this 
interface is a specific behaviour for a specific Pagent. This can include some generic 
behaviour for a type of queries ^referrably done inside an abstract class that will be 
subclassed by specific queries of that type) or handling of specific query results on a 
page. 





void 


servxce ( com . cellectivity , agent . pafaen 

This is the method which the Agent of 
this Pagent ask to process a request from. 


com. cell ectivity . agent . www . Environiu 

ent 


getEnvo 

Gets the.env attribute of the Pagent object 


com. cellectivity - agent • pagent . Pagen 

tRequest 


cretPagentReqiies -b { ) 

Gets the PagentRequest attribute of the 
Pagent object 


j ava • lang . String 


cretParserName ( ) 

. Gets the parserName attribute of the 
Pagent object 


com. cellectivity .message. Processing 

Context 


getProcessingContext 0 

Gets the processingContext attribute of 
the Pagent object 


long 


getTimeouto 

Gets the timeout attribute of the Pagent 

object 


j ava . lang . String 


getUrl ( ) 

Gets the uri attribute of the Pagent object 
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void 


inito 


j ava . lang . Runnable 


pluginQuery (com- cellectivity. query. Q 

ueryListener i^queryListener , 

com. cellectivity .util - logging. Logger i 

_logger) 

prepare the specific query/ies stuff before 
we start parsing the content and plug it into the 
XMLReader 


void 


setRequest (com. cellectivity .message . 
ProcessingContext i_context, 
com . cellectivity . agent . pagent . PagentRe 
quest i_request) 

set the request to the pagent ^ 



2J3.7 Agent 



c0m.cellectivit7.agent.www 
5 Class Agent 

com . cellectivity . agent . www . Agent 

public abstract class Agent 
10 An Agent that access pagents and controls browsing on the web. 





Aqen-t ( ) 

Construct a Agent with no initial environment 




Acfen'b (com. cellectivity . agent . www . Environment i_env) 
Construct a Agent with an initial environment. 
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com.cellectivity .protoco 
1 . ProtocolHandler 


connectAndForge-t (com. cellectivity . agent 
.pagent .PagentRequest i_request) ' 

send a request for a page to a temotc host and 
don't bothet to parse the reply. 


com. cellectivity . agent . w 
WW . Environment 


getEnvo 

Gets the env attribute of the Agent object 


com, cellectivity • agent . p 
agent - PagentRequest 


ge'bPaqen-tRequest (byte i method, 
java.lang. string i_url, 
java . lang. String [ ] i_keys/ 
java . lang. String [] i_values, 
long i_timeout) 

return a PagentRequest to visit the url and pass 
the params 


com. cellectivity . agent .p 
agent . PagentRequest 


getPagentRequest (byte i method, 

java . lang . string i__url, 

java . lang. String [] i_keys/ 

java. lang. String [] i_values, 

long i_timeout, java. net. URL i__referer) 

return a PagentRequest to visit the ud and pass 
the params 


com.cellectivity . agent .p 
agent . PagentRequest 


getPagentRequest:( java. lang. String i url 
, long i_timeout) 

return a simple PagentRequest to visit the param 

url 


void 


setEnvtcom, cellectivity . agent .www .Environ 
meht i_env) 

Sets the env attribute of the WebBrowsingAgent 

object 


void 


setReferrer(j ava.net. URL i referrer) 
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3. HTML Parser 

HTML documents are the most common type of document on the web and they 
probably have at least one of the following differences which make them non-valid XML 
documents. 

1. It contains elements that start and do not close 

2. It contains bad nesting 

3. There is no root element 

4. XML element names must be lower case where HTML is case insensitive. 

5. Element attributes are not always quoted and some attributes contain no value at 
all. 

6. .HTML contains tags which their content is defined as plaintext - Not available 
inXML. 

This prevents us from using an XML parser to parse HTML files. The great versatility 
and differences between the HTML 4.0 specification, browsers extensions and finally, 
the non-valid HTML code in many sites that browsers accept as valid, also prevents us 
from writing a strict HTML parser for the language. Web Agents requires a fast and 
robust syntactic parser which will parse a special form of HTML called Non-Strict- 
HTML. The implementation has three unique features. 

1. The parser is not strict. It does not expect valid HTML. It does not work 
according to any DTD and does not check the validity of any tag or attribute. It 
parses whatever is on the page, meaning it only identifies tags, comments, text 
etc. 

2. The parser implements org,.xml.sax JCMLReader - It fits into the SAX API 

3. Because it translates HTML to XMI^ it has a parameterized list of rules to handle 
bad nesting (Very common case in HTML on the web). 



Other differences between XML and HTML are resolved according to the table below. 



No. 


Description 


Non XMLvaHd 
HTML example 


1 Parser conforms to 


1 


Document well- 


<a><b></a></b> 


|<a><b></b></a><b></b> 
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fotmedness 






It 


rNO root xiiciiient 


missing 

<html> </html> 


Wrapping all document in out own 
root element. (XHTML: adding 
<html> </html> (and risking 
there's another one soon)) 


3 


Element and 
Attribute names in 
lower-case 


<AbC> 


<abc> 


4 


For Non-empty 
elements, end tags 
are required 


<p> paragraph... <p> 


<p> paragraph... </p><p> ...</p> 
(Done for a pre-defined set of 
elements whose end-tags are 
ignorec^ 


5 


Attribute values 
must be quoted 


<table border=3> 


<table border="3"> 


6 


Attribute 
minimization 


•^ontion selected^ 


<ot)tion selected="selected"> 


7 


Empty elements 


<br> 


<br /> 


8 


Whitespace 
handling in 
attribute values 


<input value—" my v 
alue " > 


No change (accept it as is) 
(XHTML: <input value="my v 
alue">) 


9 


ocnpt anc oiyie 
elements 


<script> xmescaped 
script content 
...</script> 


^ scnpt-^ ^ I A i -f\[. , . unescaped 
script content ...]]>< /script> 


10 


SGML exclusions 


<a><a></a></a> see 
Appendix B of 
XHTMLforafvdlUst 


accept it as is (XHTML: issue 
warning as in all cases) 


11 


The elements with 
'id' and 'name' 
attributes 


<a name="myName"> 


No change (accept it as is) 
(XHTML: <a id="myName">) 


12 


<!DocType> 
SGML decleration 


<!DOCTYPE HTML 

PUBLIC "> 


Accept it as a tag with elements 

(XHTML: SGML Decleration) 
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//W3C//DTD HTML 
4.0 transitional//EN"> 




13 


Comments 


<!-- comment ~> 


Omit them 


14 


STYLE, SCRIPT, 
SERVER, 
COMMENT, 
PLAINTEXT, 
XMP,TEXTAREA 
code 


<SCiaPT>if (0<1) 
etc.. </SCKIPT> 


<script><![CDATA[.., unescaped 
script content ...]]></ s<aapt> 



HTML Parser behaviour for specific XML validity problems in HTML, 



The parser follows the same Unes as the org.xml.sax.XMLReader with several minor 
changes: The extended §AX API of LexicalEventListener and DTDEventJJstener are 
5 ignored (it does not validate the code), A new listener NonStrictParsingEistener has 
been introduced to mark events where the parser had to modify the original content in 
order to remain valid or had to ignore content in order to remain valid. In order to be as 
efficient as possible, the amount of NonStrictParsing events this parser fires is limited to 
cases where no error event is fired. 

10 

The parse is highly configurable and its rules of nesting can be modified according to 
specific erratic behaviour of different sites. The main idea is that whenever we encounter 
bad nesting elements we can decide what to do according to those elements and this wiU 
.affect the generated XML. For example, one of the options is to define elements as 
15 block tags and then everything inside them will be closed when their scope ends. If an 
element is not defined as a block, the open elements will need to be closed (We must 
have valid nested tags), but then they will re-open after the element's scope ended. 

See Appendix III for an example of scope rules for the HTML Parser. 

20 
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4. Xcomp 
• 4.1 Overview 

Xcomp is a query language for XML content. It is based on a research OODB query 
language called Comprehension (or COMP for short) and on Xpath for applying the 
5 queries to XML. Xcomp's strength lies in the fact that it is adapted to the object 
oriented nature of XML and that its definition and functionality allows a very efficient 
implementation based on a parsed stream of events (SAX) and does not require the 
parsing of the whole XML document in order to start returning results. This has a huge 
importance when you deal with the web and waiting for a whole document to download 
10 and saving all of it in memory is simply too heavy for your process. The remainder of 
this document will introduce the language syntax and semantics. Then the compiler and 
engine are described, 

4.2 Syntax & Semantics 
15 The Xcomp language syntax is based on COMP where the viariables declarations are 
done using XPatii-like expressions. 

See Appendix I for the Xcomp BNF grammar. 

20 4.2.1 Select & Where 

Every expression is surrounded by curly braces '{}' and is split into two parts by a 
vertical line *\\ 

To use SQL terminology, the left side of the expression is the select part and the right 
side is the where part 

25 

{ select I where } 

Xcomp query basic syntax 

30 . . 

This is the basic syntax for every COMP expression and is borrowed from a definition of 
a set in set theory. 

The select is one or more expressions. 
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The where part is split into variable declarations and conditions. 

4.2.2 Query Results 

A result of an Xcomp query is defined as a set (only in the framework this set is 
translated to a sequence of events). 

Each element in this set (or each event) is the list of all the values of the select part 
according to the variables evaluated in the where part and only if all the conditions values 
inside the where part were true. 

A result is evaluated when the scope of a range expression variable is closed. (See section 
4.2.4 for an explanation about range expressions). 

4.2.3 Bxpressions 

An Xcomp query is composed from expressions which are evaluated by the engine. 
These expressions may appear in the select part to define a value we want to select. They 
can appear as the declaration of a variable or they can appear as a value inside a condition 
where a relational or conditional operator is applied on. 
An expression can be one of the following tj^es: 

1. A constant, 

2. A Pagent parameter (its value depends on the context in which the query runs). 

3. An XPath-like expression 

4. A variable object. 

5. A member field of a variable object. 

6. A method call on a variable object 

An XPath-like expression is a subset of the XPath definition. In Xcomp, we define 
the path using separators or *//', tag names and die Kleene star 
, For 'a/b' to match a path, *b' should be nested directiy below V. 
For V/b' b' to match a patii, *b' should be nested somewhere inside the scope of V. 
A Kleene star means any element. 

The path can start with a variable or with a separator. If it starts with a variable then the 
root of the path will be this variable value. A separator means the path's root is the root 
of the document. Any path can contain inner conditions inside brackets. 'Q*. Those 
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conditions can also be general (using the variables in scope) but usually they will be 
specific to that pathos element. 

A member field of an object applies only for objects of type ELEMENT. Calling 
"x.foo", is simply a shorthand for using the method x.getAttrValue("foo"). 
A method is defined on a variable. The method declaration needs to be declared outside 
the Xcpmp expression. 

Xcomp allows using any Java method for any object. (See section 4.3.1 for more details). 



NULL 


- Constant 


TRUE 


- Constant 


5 


- Constant 


paramCTMameOfForm") 


- Parameter 


X 


- Object (x is a variable previously declared) 


x.href 


- Object field (x is a variable previously declarec^ 


x.perl5SpUt("\\s(.)\\s") 


- Object method (x is a variable previously 


declared) 




//a/b/*/*/d 


- XPath-like expression . 


//a/b[.width > 20]//c/d[.si2e == 


"100%^ -XPath-like expression 


x//a//b 


- XPath-like expression (x is a variable 


previously declared). 





Xcomp expressions examples 



4.2.4 Variables 

There are two types of variable declarations in Xcomp: 

1. A simple assignment marked in the Pascal syntax 

2. A range expression (marked '::='). Range expression variables are declared 
using an XPath-like expression. 

An assignment defines a variable by giving the expression that evaluates this variable's 
value. 
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A range expression must be defined only by a path and therefore its type will always be 
an ELEMENT. Any range expression in an Xcomp query defines not only the variable 
value but also when the engine will try to evaluate a result for the query. This is how the 
query programmer can define iteration on a set of values on the XML source (like a list 
of prices on a site, list of search results links etc.) If the query programmer is only after 
one matched pattern then this rule wiU still apply and the pattern that needs to be 
matched by the query must defined so there's only one. This is a good practice in a 
structured data such as XML which the language forces on the user. 

The scope of any variable is the select area and the where area immediately to the right of 
its declaration. This definition prevents expressions with several variables with the same 
name and also prevents a deadlock where values of different variables depends on each 
other's value. 



^ ^lii 1 x:;=//table , Mjff^-X^W4 ) 
^ cppe of variable 'x*! 

MpQPg_ r an d, ^y^ 

Variable declaration scope 

4.2.5 Types 

The Xcomp language has five main variable types: 
1. STRDSra 
Z INTEGER. 

3. BOOLEAN. 

4. ELEMENT- An XML element (com.cellectivity.content.Element) - Name 
and attributes. 

5. OBJECT. 

The language contains integer constants (defined as NUMBER - one or more digits.) 
boolean constants (TRUE or FALSE) and String constants (defined by double 
quotes around die string text). It also contains NULL - The null keyword. 
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In addition to these main types, any java type can be integrated into the language. The 
language core treats those types as Object but the type checking in compile time take note 
of the specific type and will fail to compile the Xcomp classes if there is a mismatch. 
Types are used for two things - Type checking in compile time and Type casting in 
runtime. Type checking is strict only in compile time where it looks for tjrpe conflicts and 
may fail to compile because of that. 

The strict typing applies for the results to the query, which are assigned to a specific type, 
which appears in the method declaration of the listener to the query engine. (The 
Pagent). Strict typing is also used to check the method calls inside an Xcomp expression. 
A method can be called from only a specific type. During runtime, the engine will always 
try to perform a casting firom one type to the other. 

Type casting rules: 

1. Element — > String: Using e.textQ; 

2. Object --> String: Using o.toStringQ; 

3. String — > Integer: Using Integer.parselnt(s); 

4. String — > Boolean: Using Boolean. valueOf(s); 

5. Integer — > Boolean: Using (0 == integer); 

6. NULL Integer: -1 ; 

7. NULL > Boolean: Fake; 

These rules allow the engine to convert types at runtime and solve mismatches. They can 
also be applied more than once so converting an object to an Integer is done by 
converting it into a String and then the String into an Integer. 

Non strict typing allows more flexibility and adds functionality to the language. 

For example: 

{ X I x::=//table , x. width > 50 AND x. height < 20 } 



Xcomp query where casting is needed 
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This expression is perfectly legal even though there seem to be a tj^e conflict by 
comparing an integer to a string. If the width attribute value cannot be translated to an 
integer value (lets say, "20%") the condition will throw an exception. This is why this 
5 expression should invoke a warning in compile time to alert the programmer that a type 
conflict might occur. Note that not all possible conversions are made but only ones that 
gives the programmer flexibility. For example, an Integer will not be converted into a 
String. Using an integer in a regular expression operator will result in an Xcomp 
compilation error. 

10 ' 

4.2.6 Conditions 

Xcomp conditions are simply a convenient common syntax to method with a Boolean 

return type. The Xcomp language supports all the common equality operators and 

because of its pluggable nature the user can easily introduce new Boolean methods — new 
15 conditions. A group of conditions that were widely used in our implementation, and 

were added to the language as operators, is pattern matching using regular expressions. 

We introduced die operators MATCH, CONTAINS and -MATCH, -CONTAINS 
means case insensitive) as operators in the language. 

The integration of regular expressions into Xcomp is a natural progression that adds a lot 
20 of power to the language and fits into the stream based approach where the queries are a 

sort of a structured data pattern match. 

The Xcomp language also supports the use of parenthesis *0' Boolean operators 
AND, NOT and OR. 

25 4.2.7 Examples 

Below are some example Xcomp queries. Their description below borrows the elements 
meaning from the HTML latiguage. 



{ x.href I x::=//a } 
30 . Return all hyperlinks on a page 



{ x.refO I x::=//a } 
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Return the result of the ref method for all hyperlinks on a page 



{ x.href I x::=//a , a.href CONTAINS "http://foo *" , a.alt 1= NULL} 



Return all hyperlinks on a page which contain the regular expression *' http:7/foo and 
5 their alt attribute exists. 



{ y I x:=//table , y::=x/tr/td , x.name == "foo"} 



Return all table elements in a table named "foo" 



10 



{[ x.refO , x.textO , y.textQ ] | 

x::=//a , y::=x//b[.textO MATCH - ^\s*r/| MATCHES BY^ , 

(x.alt == "Click here for details" AND (xxlass == "litebgartist" OR x.class 

"Utebgride")) 

} 



15 Return a list of a ref method result on a hyperlink, die text of the hyperlink and the text 
of a bold tag 



4.3 Xcomp Language implementation 

20 This section covers our implementation of the Xcomp language. This is not part of the 
language definition but many of the Xcomp advantages are derived from this 
implementation. Xcomp queries are defined in Xcomp files. Those files are ^compiled' 
into java source files of the specific engine for every query. 
Appendix II contains the BNF Granmiar for Xcomp files. 

25 . 

4.3.1 Methods Declarations 

An Xcomp method could be any java method. There is not interface to implement, no 
special guidelines to follow. The way we link it to Xcomp is by describing it in die 
30 Xcomp configuration (or dynamically in the Environment). The only thing the Xcomp 
engine needs is a mapping between the location of every method that we want to use and 
the actual method information — the signature; the objects it operates on and some 
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additional flags for optimizations purposes. In order to write methods for java t3rpes such 
as string, the descriptive mode also allows us to define a static method where the object it 
operates on is given as the first parameter. 

The method types in this configuration can be any Xcomp type (OBJECT, ELEMENT, 
STRING, INTEGER or BOOLEAN) or any Java type (for example, java,lang.String). 
This is important for the type checking of the methods. If a method is defined to return 
STRING it means that following Xcomp dynamic casting rules, the type check will pass 
even if the value of the method is later used as an INTEGER. If however, the return 
type was javaJang.String, the type check would fail The same is also true for the type of 
object which the methods is defined to be on. 



Class ELEMENT; 

// index of variable appearances in the doc. 
Signature index ( ) 
Location 

com . cellectivity . query . xcomp . DocumentElement . getVarValue Index ( ) 
ReturnType INTEGER 

Class STRING; 

// Returns trimmed string with 1 space instead of every 
whitespace seq. 
Signature htmlTextO 

Location com. cellectivity. query. xcomp. html. Util.htmlText (this) 
ReturnType STRING 
SaveText true; 

•Signature substring {INTEGER) 
Location java.lang. String. substring (int) 
ReturnType STRING 
SaveText true; 

Signature substring {INTEGER, INTEGER) 
Location java. lang. String. substring (int, int) 
ReturnType STRING 
SaveText true; 
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Method declarations examples 
In the above example there are four methods defined: 

• indexQ which operates on an Xcomp ELEMENT type and returns an 
INTEGER. 

• htmlTextO which operates on STRING. The actual implementation of this 
method will be static (The method does not appear in the actual class which it 
'operates on' - java.lang.String) and it will contain one argument - The object it 
operates on, marked as this. Note that htmlTextQ operates on STRING which 
means it will also operate on an object of type ELEMENT or OBJECT. If we 
would have defined the class name to be - java.lang. String, then calling 
htmlTextQ on an ELEMENT would give us a type mismatch dxiring compilation. 
The 'saveText' boolean flag is a compiler directive used to define whether the 
text inside the scope of the element will be used and therefore needs to be saved. 
The default value is false. - This flag is for optimizations of the engine; we don't 
want to save the text for every element. 

The two other methods are substring(int) and substring(int , int) defined in the 
java.lang.String class. This example shows the advantage of using a descriptive mode 
when defining Xcomp methods. No interface needs to be implemented and any java 
method can be used once it is declared. 

4.3.2 Xcomp Set of Queries 

Using the Xcomp configuration, we can define one or more query per page. All those 
queries are unrelated but are processed at the same time on the same stream of events. 
This enables the query programmer easier integration with the site. In some pages one 
may want to look for two unrelated pieces of data (like a list of results AND the link to 
the next page of results). In some cases, one can also define queries which are valid for 
all the pages of a site and then just add other queries specific to that page. This is 
particularly easy when the importing capability is used (see section 4.3.4) . 
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4.3.3 Xcomp Filters 

By using, the query object as org.xml.sax.XMLFilter and not only 
org.xml.sax.ContentHandler, one can chain queries and pass the results of one query as 
the events input for the second query. In some case this could prove a very powerful 
capability. Specifically, it enables us to save a state during out query processing. 

4.3.4 Import Statements 

Xcomp configuration allows us to import Xcomp files firom other Xcomp files. Using it 
one can declare methods used frequently in a separate file and then import it. One can 
define general queries for the whole site and then import it etc. 

4.3.5 Framework Configuration from Xcomp 

The Xcomp configuration file is also used to define some framework configurations. 
This is optional as the framework does not require any configuration but, if the 
programmer requests a specific variant of a parser or wants to override the content 
parser searching method, she can do so firom within the Xcomp file by declaring the 
content parser by name. 

4.3.6 Compiler 

Our Xcomp compiler reads the Xcomp file, parse the queries it contains and generates 
java classes for each query + the Pagent that controls all those query objects, the parser 
and the protocol handler. The query classes are the Xcomp engines for a particular 
query. This compilation phase with its configuration (Xcomp) files is the only 
connection between die language implementation and the firamework. See Appendix II 
for the BNF Grammar of the Xcomp file. 

4.3.7 Engine 

The Xcomp engine implementation is a group of methods to handle specific events and 
a data stmcture to maintain the state between those events. The engine contains an 
event handling method for every start and element of a tag relevant to the query. There 
is no 'main' method for the query processing and it only acts as a reaction to events. 
This makes it perfect for using with SAX. The query processing is managed by the state 
kept on die query object. This state specifically defines what the value of every path is. 



wo 03/014971 



33 



PCT/GB02/03702 



Whenever there is an event that closes a tag which results in a value to a range expression 
variable, the engine will evaluate all conditions, all variable values and will fire a result if 
there is a need to. 
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Appendix I 

Xcomp expression BNF Grammar (in JavaCC syntax) 

SKIP : /* WHITE SPACE */ 
{ 

If n 
I n\t" 

I "\n" 
I "\r" 
I "\f" 
} 

/* SEPARATORS */ 

TOKEN : 
{ 

< LPAREN: "(" > 
! < RPAREN: ") " > 
I < LBRACE: "{" > 
1 < RBRACE: "}" > 

I < LBRACKET: > 
I < RBRACKET: "] " > 
1 < SEMICOLON: ";" > 
! < COMMA: " > 
I < STAR: > 
I < SLASH: "/" > 
} 

/* OPERATORS */ 

TOKEN : 
{ 

< GT: ">" > 
I < LT: "<" > 

I < EQ: "==" > 
I < LE: "<=" > 
I < GE: ">=" > 
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< LE2: "=<" > 

< GE2: "=>" > 

< NE: "!=" > 

< ICEQ: > 

< MATCH: "MATCH" > 

< CONTAINS: "CONTAINS" > 

< ICMATCH: "-MATCH" > 

< ICCONTAINS: "-CONTAINS" > 

< PLUS: > 

< MINUS: "-" > 

< MOD: "%"> 



/* BOOLEAN OPERATORS */ 

15 

TOKEN : 
{ 

<AND: "AND" | "and" 1 "And"> 
I <OR: "OR" I "or" | "Or"> 
20 1 <NOT: "NOT" | "not" | "Not"> 
} 

/* BOOLEAN VALUES */ 



25 TOKEN : 
{ 

<TRUE: "TRUE" I "true" | "True"> 
|<FALSE: "FALSE" | "false" I "False"> 

' } 

30 

/* XPATH AXIS */ 

TOKEN : ' 
{ 

35 <ANCESTOR: "ancestor "> 

I <ANCESTOR_OR_SELF: "ancestor-or-self "> 

I <CHILD: "child"> 

I <DESCENDANT: "descendant "> 

I <DESCENDANT OR SELF: "descendant-or-self "> 
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I <FOLLOWING: "fol lowing "> 

I <FOLLOWING__SIBLING: "f ollowing~sibling"> 

I <PARENT: "parent"> 

I <PRECEDING: "preceding"> 

I <PRECEDING_SIBLING: "preceding-sibling" > 
1 <SELF: "self"> 



TOKEN: /* identifiers and numbers */ 
{ 

<NUMBER: (<DIGIT>)+ > 
I <NULL: "NULL"> 
I <PARAM: "param"> 
I <DOT: > 

I <IDENTIFIER: <LETTER> (<I,ETTER> | <DIGIT>) * > 
I 

< # LETTER: 
[ 

"\u0041"-"\u005a", 

"\u0061"-"\u007a", 

"\u00c0"-"\u00d6", 

"\u00d8"-"\u00f6", 

"\u00f8"-"\u00ff", 

"\u0100"-"\ulf ff 

"\u3040"-"\u318f 

"\u3300"-"\u337f", 

"\u3400"-"\u3d2d", 

"\u4e00"-"\u9fff", 

"\uf900"-"\ufaff " 

3 

> 

1 < #DIGIT: 
[ 

"\u0030"-"\u0039", 
"\u0660"-"\u0669", 
"\u06f0"-"\u06f9", 
"\u0966"-"\u096f", 
"\u09e6"-"\u09ef". 
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"\u0a66"- 


"\u0a6f "/ 


"\u0ae6"- 


"\uOaef "/ 


"\u0b66"- 


"\u0b6f 


"\u0be7"- 


"\uObef", 


"\u0c66"- 


"\u0c6f". 


"\u0ce6"- 


"\uOcef 


"\u0d66"- 


"\u0d6f 


"\u0e50"- 


"\uOe59", 


"\uOedO "- 


"\u0ed9". 


"\ul040"- 


"\ul049" 



] 

> 



15 

/* XCOMP SPECIFIC */ 

TOKEN : 
{ 

20 <IC: "-"> // The - sign signals 'ignore case' 

I <SELECT_WHERE_SEP: " | "> 

I <ASSIGN: ":="> 

I <RANGE_EXPR: "::="> 

I <AXIS_SEP: ": :"> 

25 I <STRING_CONSTANT: 

11 ^ tl ?! 

( (•-["\"","\\","\n","\r"]) 
I ("W" 

( ["n", "t", "b", "r", "f", "\\", " ' "\""] 
30 I ["0"-"7"] ( ["0"-"7"] )? 

I ["0"-"3"] ["0"-"7"] ["0"-"7"] 
) 

) 

)* 

35 "\"" 
> 

) 
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start 0 : 
{ 

<LBRACE>select ( ) <SELECT_WHERE_SEP>where ( ) <RBRACE> <EOF> 

} 

select 0 : 
{ 

scalarVal () 

1 <LBRACKET> varlistO <RBRACKET> 

} 

varlistO : 
{ • 

scalarVal () 

(<COMMA> scalarVal 0 ) * 

} 

varAppearance ( ) : 
{ 

<IDENTIFIER> 

} 

elementExpr ( ) : 
{ 

<DOT> (memberName ( ) (methodParams ())?)? 

} 

memberName 0 : 
{ 

<IDENTIFIER> 

} 

methodParams {) : 
{ 

<LPAREN> ( s calarVal { ) 
(<COMMA> scalarVal 0 ) *) ? <RPAREN> 

} 
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where 0 : 
{ . 

} 
{ 

whereElement () (<COMMA> whereElement () ) * 

} 

whereElement ( ) : 
{ 

LOOKAHEAD(2) 
assignment ( ) | 
L00KAHEAD(2) 
rangeExpr ( ) ( 
cond ( ) 

} 

cond ( ) : 
{ 

(<LPAREN> condO <RPAREN> 
I 

<NOT> condO 
I 

globalAtomicExpression ( ) 
) 

(LOOKAHEAD(l) (<AND>' | <0R>) condNotRecursive ( ) 
)* 

} 

condNotRecursive 0 : 
{ 

(<LPAREN> condO <RPAREN> 
I 

<NOT> condNotRecursive ( ) 
I 

globalAtomicExpression ( ) 
) 

} 



wo 03/014971 



PCT/GB02/03702 



40 



assignment () : 
{ 

5 varNameO <ASSIGN> (LOOKAHEAD <2) pathExpr < ) 1 varAppearance ( ) ) 

(elementExpr () ) ? 

) 

10 rangeExprO : 

{ ' 

varNameO <RANGE_EXPR> pathExpr () 

} 

15 pathExpr 0 : 
{ 

(varPathExpr 0 ) ? 

(pathExprSep ( ) pathElement < ) ) + 

} 

20 

varPathExpr ( ) : 
{ 

(axis () ) ? varAppearance () 

} 

25 

pathExprSep ( ) : 
{ 

<SLASH> (<SLASH>) ? 

} 

30 

pathElement () ; 
{ 

(axisO)? ( AnyO | <IDENTIFIER» 
{<LBRACKET> elementCond () <RBRACKET» ? 

35 } 

varName ( ) : 
{ 

<IDENTIFIER> 
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} 

axis <) : 
{ 

5 {<ANCESTOR> I 

<ANCESTOR_OR_SEIjF> I 
<CHILD> 1 
<DESCENDANT> | 
<DESCENDANT_OR^SELF> I 
10 <FOLLOWING> I 

<FOLLOWING_SIBLING> | 
<PARENT> I 

<PRECEDING> I - . 

<PRECEDING_SIBLING> | 
15 <SELF> 
) 

<AXIS_SEP> 

} 

20 scalarValO : 
{ 

(varAppearance ( ) (elementExpr ( ) ) ? 
I 

parameter ( ) 1 
25 <STRING__CONSTANT> 
I 

<NUMBER> 
I 

<NULL> 

30 I 

<TRUE> 
I 

<FALSE> 
) 

35 } 



inne rE lemen tCond ( ) : 
{ 

compos iteExpress ion ( ) 
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} 

elementCond ( ) : 
{ 

5 . innerElementCond ( ) 

( <COMMA> innerElementCond ( ) 
)* 

} 

10 parameter 0 : 
{ 

<PARAM> <LPAREN> <STRING_CONSTANT> <RPAREN> 

} 

compositeExpression { ) : 
{ 

(<LPAREN> compositeExpression 0 <RPAREN> 
I 

<NOT> v~compositeExpression 0 
I 

atomicExpression () 
) 

(LOOKAHEAD(l) (<AND> | <0R>) compositeExpression ( ) 
)* 

} 

atomicExpression 0 : 
{ 

(L00KAHEAD(2) 
30 elementExpr ( ) 

«GT> I 
<LT> I 
<EQ> I 
<LE> 1 

35 <GE> I 

<LE2> I 
<GE2> 1 
<NE> I 
<ICEQ> I 



15 



20 



25 
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<MATCH> I 
<CONTAINS> I 
<ICMATCH> 1 
<ICCONTAINS> I 
<PLUS> 1 
<MINUS> I 
MulO 1 
DivO 1 
<MOD> 

) 

( scalarVal ( ) | elementExpr ( ) ) 
I 

globalAtomicExpression ( ) 
) 

} 

globalAtomicExpression ( ) : 
{ 

scalarVal ( ) 

(LOOKAHEAD (globalAtomicExpressionOp ( ) ) 
globalAtomicExpressionOp ( ) ' 
scalarVal () 
I 
) 

) 

globalAtomicExpressionOp { ) : 
{ 

(<GT> I 

<LT> I 
<EQ> 1 
<LE> 1 
<GE> j 
<LE2> I 
<GE2> I 
<NE> I 
<ICEQ> I 
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<MATCH> I 
<CONTAINS> I 
<ICMATCH> I 
<ICCONTAINS> I 
<PLUS> I 
<MINUS> I 
Mul ( ) I 
Div ( ) I 
<MOD> I 

) 

} 



Any ( ) : 
{ 

<STAR> 

} 

MulO : 
{ 

<STAR> 

y 

Div ( ) : 
{ 

<SLASH> 

} 



STRING.CONSTANT: 

II ^ n fi ' 

( (-["\"","\\","\n","\r"]) 

I r\y\ 

( ["n", "t", "b", "r", "f "\\", " ' "\""] 

I ["0"-"7"] ( ["0"-"7"] )? 

I ["0"-"3"] [»'0"-"7"] ["p"-"7"] 

) 
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) 



•)* 

It ^ II II 
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Appendix II 

Xcomp configmation file BNF Grammer (in JavaCC syntax) 

SKIP : /* WHITE SPACE */ 
{ 

II II 

I "\t" 
I "\n" 
1 "\r" 
I "\f" 
} 

/* COMMENTS */ 

MORE : 
{ 

" / / " : IW_SINGLE_LINE_COMMENT 

I 

<ii/**»» «-["/"]> : IN_FORMAL_COMMENT 

I 

II / * II ; IN_MULTI_LINE_C0MMENT 

} 

<IN_SINGLE_LINE_COMMENT> 

SPECIAL_TOKEN : 

{ 

<SINGLE_LINE_COMMENT: "\n" | "\r" | "\r\n" > : DEFAULT 

} 

< IN_FORMAL_COMMENT> 

SPECIAL_TOKEN : 

{ 

<FORMAL_COMMENT : "*/" > : DEFAULT 

} 

<IN_MULT I_LINE_COMMENT> 
SPECIAL TOKEN : 
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<MULTI_LINE_COMMENT: "*/" > : DEFAULT 

} 

<IN_SINGLE_LINE_COMMENT, IN_FORMAL_COMMENT , IN_MULTI_LINE_COMMENT> 

MORE : 

{ 

< -[] > 

} 

TOKEN : /* PARSER HEADER */ 
{ 

<PARSER_HEADER: "ContentParser"> 
I <IMPORT_HEADER: " Import "> 
I <FINAL_XCOMP_HEADER: "Query" > 
I <FILTER__XCOMP_HEADER: " Filter "> 
I <XCOMP_HEADER: "Xcomp"> 
I <CLASS__HEADER: "Class "> 
I <METHOD_SIG_HEADER: "Signature"> 
I <METHOD_LOCATI0N_HEADER : "Location"> 
1 <METHOD_RETURN_T YPE_HEADER : " Re turnType " > 
} 

TOKEN : 
{ 

<COLON: ":" > 
1 <SEMI COLON: ";" > 
1 <COMMA : " , " > 
I <RPAREN : " ( " > 
I <LPAREN: ") " > 
I <RBRACE: "}" > 
I <LBRACE: "{" > 
} 

TOKEN : 
{ 

< XC'OMP__EXPR: <LBRACE> <INSIDE_BLOCK> <RBRACE> > 
I < #INSIDE_BLOCK: (--["}"])* (( "W} ")(--["}"])) * (--[">"])*> 
} 
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TOKEN : 
{ 

< XCOMP^TEMPLATE : <XC0MP_TEMPLATE_OPEN> { E ] ) * 
<XCOMP_TEMPLATE_CLOSE» 

I < #XCOMP_TEMPLATE_OPEN: "<" <ELEMENT_NAME> ">"> 
I < #XCOMP_TEMPLATE_CLOSE : "</" <ELEMENT_NAME> ">"> 
I < #ELEMENT_NAME : " xcomp_re suits " > 
} 



TOKEN : /* IDENTIFIERS */ 
{ 

< BOOLE AN_VALUE : "true" 
\ < NUMBER: (<DIGIT>) + > ■ 
I < NAME: (<IDENTIFIER>) + 
(<IDENTIFIER>) *) * > 
I < # IDENTIFIER: <LETTER> 
! < #DOT: > 
1 < #ARRAY_DEF: "[]" > 
I < #SEP: "/" > 
I < # LETTER: 
[ 

"\u0024", 

"\u0041"-"\u005a", 
"\u005f 

"\uO061"-"\u007a", 

"\u00c0"-"\u00d6", 

"\u00d8"-"\u00f6", 

"\uO0f8"-"\uO0ff 

"\u0100"-"\ulfff", 

"\u3040"-"\u318f", 

"\u3300"-"\u337f", 

"\u3400"-"\u3d2ci", 

"\u4e00"-"\u9fff", 

"\uf900"-"\ufaff" 

] 

> 

1 < #DIGIT: 



I "false"> 

( (<SEP> I <DOT> I <ARRAY_DEF>) ? 
(<LETTER>|<DIGIT>) * > 
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[ 



"\u0030"- 


"\u0039"/ 


"\u0660"- 


"\u0669". 


"\u06f0"- 


"\u06f 9", 


"\u0966"- 


"\u096f 


"\u09e6"- 


"\u09ef 


"\u0a66"- 


"\uOa6f 


"\u0ae6"- 


"\uOaef 


"\u0b66"- 


"\u0b6f", 


"\u0be7"- 


"\uObef 


"\u0c66"- 


"\uOc6f 


"\uOce6"- 


"\uOcef", 


"\u0d66"- 


"\u0d6f". 


"\u0e50"- 


"\u0e59". 


"\uOedO 


"\u0ed9". 


"\ul040"- 


"\ul049" 



] 

> 

) 

20 

start () : 
{ 

( ImportExpr ( ) ) * 
(ContentParser 0 ) ? 
25 (XcompMethodDecl 0 ) * 

(XcompClassExpr () ) * <EOF> 

} 



importExpr ( ) : 
30 { 

<IMPORT_HEADER> NameO <SEMlCOLON> 

} 

ContentParser {) : 

35 

{ 

<PARSER_HEADER> NameO <SEMICOL0N> 

} 
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XcompMethodDecl ( ) : 
{ 

<CLASS_HEADER> Name ( ) <SEMICOLON> (XcompClassMethodData ( ) ) * 

} 

5 

XcompClas sMethodData ( ) : 
{ 

<METHOD_SIG__HEADER> MethodSignature ( ) 
<METHOD_LOCATION__HEADER> MethodSignature () 
10 <METHOD_RETURN_TYPE_HEADER> Name 0 (BooleanFlag ( ) ) * <SEMICOLON> 

} 

Bool eanFl ag ( ) : 
{ 

15 Name ( ) <BOOLEAN__VALUE> 

) 

MethodSignature ( ) : 
{ 

20 Name ( ) <RPAREN> 

( ( LOOKAHEAD ( 2 ) Name ( ) <COMMA> ) *Name ( ) ) ? <LPAREN> 

} 

XcompClas sExpr ( ) : 
25 { 

<XCOMP_HEADER> <NUMBER> 
(XcompFilterExpr ( ) 

FinalXcompClassExpr ( ) <SEMICOLON> 

30 } 

' XcompFilterExpr 0 : 
{ 

< F I LTER_XCOMP_HE ADER> 
35 (XcompXmlTeraplate () ) ? 

XcompExpr ( ) 

} 



XcompXmlTemplate () 
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{ 

<XCOMP_TEMPIATE> 

) 

5 FinalXcompClassExpr { ) : 
{ 

<FINAL_XCOMP_HEADER> 
XcompExpr ( ) 

) 

10 

XcompExpr ( ) : 
{ 

<XCOMP_EXPR> 

} 

15 

Hame ( ) : 
{ 

<NAME> 

) 

20 
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Appendix III 



Example fot HTML Parser Scope Rules 



No 


Element 
Name 


No 

Scop 
e 

Tag 


Closes Scope Of 


Close by Next 
Appearance 


0 


IS INDEX 








1 


BASE 


-k 






2 


MET A 


★ 






3 


LINK 








4 


HR 








5 


BR 








6 


INPUT 








7 


IMG 








8 


PARAM 


* 






9 


BASEFON 

T 


* 






10 


AREA 








11 


NEXT ID 


* 






12 


RT 








13 


EMBED 








14 


KEYGEN 


* 






15 


SPACER 


* 






15 


WBR 








17 


FRT^E 


* 






18 


BGSOOND 


★ 






19 


DT 






+ DD 


20 


DD 






+ DT 


21 


THEAD 




TR, TH 
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22 


FIELDSE 
T 




OPTION, SELECT, LEGEND, 
LABEL, BUTTON 




23 


TR 




TD, TH 


* 


24 


TD 






+ TH 


25 


TH 






+ TD 


26 


CAPTION 








27 


HTML 




all 




28 


HEAD 




all 




29 


BODY 




all 


■k 


30 


TABLE 




TR, TD, TH, CAPTION, 
COLGROUP, COL, THEAD, 
TBODY, TFOOT 





31 


UL 




LI 




32 


OL 




LI 




33 


DL 




DD, DT 




34 


DIR 




LI 




35 


MENU 




LI 




36 


SELECT 




OPTION 




37 


TBODY 




.TR, TD 




38 


FORM 




OPTldN, SELECT, 
FIELDSET , LEGEND , LABEL , 
BUTTON 




39 


LEGEND 








40 


COLGROU 
P 




COL 




41 


COL 








42 


TFOOT 




TR, TD 




43 


LI 






* 


44 


OPTION 








45 


LABEL 








46 


BUTTON 








47 


NOFRAME 




all 
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S 








48 


I FRAME 




all 




49 


I LAYER 




all . 




50 


LAYER 




all 




51 


NOLAYER 




all 




52 


NOEMBED 




all 




53 


NOSCRIP 

T 




all 




54 


INS 




all 




55 


DEL 




all 




56 


P 






* 


57 


A 






* ' 
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CLAIMS 

1. A web interaction system which enables a mobile telephone to interact with web 
resources, in which the web interaction system comprises a query engine which 
operates on XML format data obtained from content data extracted from a web 
site, the query engine parsing the XML format data into SAX events which are 
then queried by the query engine, 

2. The system of Claim 1 in which querying the SAX events is achieved using an 
object oriented XML query language with an event stream based engine. 

3. The system of Claim 1 in which the XML format data which the query engine 
operates on is obtained, either directiy or indirectiy via a translation engine, by an 
automated agent from content data relevant to goods or services to be purchased 
using the mobile telephone. 

4. The system of Claim 3 in which the web site provides the content data in XML, 
HTML or Javascript format. 

5. , The system of Claim 4 in which the web site provides content data in non-XML 
format data, which is translated into valid XML using a translation engine by the 
web interaction system. 

6. The system of Claim 5 in which the translation engine can fully define the nesting, 
semantics needed for efficient and valid XML. 

7. The system of Claim 1 in which the web interaction system uses an extensible 
plug-in framework which allows plug-in components to be readily added to the 
framework. 



8. 



The system of Claim 7 in which the plug-ins cover different parsers, support for 
different protocols or different query languages. 



10 
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9. The system of Claim 1 which uses business logic defined by a mobile telephone 
operator to prioritise or filter search results according to predefined rules. 

10. The system of Claim 1 in which the system automatically interrogates web based 
resources from multiple suppliers to allow a user of the mobile telephone to 
compare similar goods or services from different suppliers without those suppliers 
needing to provide wireless protocol specific data. 

11. The system of Claim 1 which automates user defined processes, enabling the user 
to delegate tasks to the system without the need for continued real time connection 
to the Internet, 

12. The system of Claim 1 which can be modified by user' defined preferences or 
profiles, 

13. The system of Claim 1 which can supply data records defining the details of the 
process used by customers to look for goods or services to purchase. 

14. A method of enabling a mobile telephone to interact with web resources, in which 
20 the method comprises the steps of: 

(a) extracting content data from a web site according to .an instruction sent 
from the mobUe telephone; 

(b) obtaining XML format data from the content data; 

(c) parsing the XML format data into SAX events; 

25 (d) querying the SAX events using a query engine to generate query results; 

(e) providing a response to the instruction sent from the mobile telephone 

using the query result. 



15 



30 15. 



The method of Claim 14 in which querying the SAX events is achieved using an 
object oriented XML query language and an event stream query engine. 
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16. The method of Claim 14 in which the XML format data is obtained, either directly 
or indirecdy via a translation engine, by an automated agent from content data 
relevant to goods or services to be purchased using the mobile telephone. 

17. The method of Claim 16 in which the web site provides the content data in XML, 
HTML or Javascript format. 

18. The method of Claim 14 in which the web site provides content data in non-XML 
format data, which is translated into valid XML using a translation engine. 



19. 



The method of Claim 18 in which the translation engine can fully define the 
nesting semantics needed for efficient and valid XML. 
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Figure 1 



SUBSTITUTE SHEET (RULE 26) 
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Protocol 
Handler 
Factory 



Agent 
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Environment 




URL 



Protocol 
Handler 



3 



Query Results 



Parser 
.Factory 
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